Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Targeted Gene Metagenomic Data Analysis ◾ 261

unweighted pair group method with arithmetic mean (UPGMA), weighted pair group

method with arithmetic mean (WPGMA), and neighbor joining (NJ) [14].

Both UPGMA and WPGMA assume a randomized molecular clock that measures the

evolutionary divergence of sequences. The molecular clock is defined as the average rate at

which a sequence accumulates mutations. Both UPGMA and WPGMA also have a similar

algorithm. They use a cluster procedure that assumes each representative sequence as a

cluster on its own and then they join the closest clusters and recalculate the distance of the

joint pair by the average. These steps are repeated until all sequences are connected in a

single cluster. However, the difference between the two methods is that in UPGMA, equal

weight is assigned on the distances, while in the WPGMA different weights are assigned

on the distances.

The algorithm of the NJ method does not make an assumption of the molecular clock

and it adjusts for the rate variation among branches. The algorithm begins with an initial

unsolved star-like tree made up of the representative sequences. The distance between each

pair is evaluated. The first joint is created by joining the closest two neighboring sequences

and a branch is inserted between them and the rest of the star-like tree. The value of the

branch is recalculated on the basis of their average distance. This process is repeated until

only one terminal is present from the initial tree.

The above briefly described tree construction methods are distance-based and less

computationally expensive. However, there are other methods including maximum par-

simony (MP) and maximum likelihood (ML) which make use of all known evolutionary

information (individual substitutions) to determine the most likely ancestral relationships.

Refer to a book for phylogenetic tree for more details about the various tree construction

methods.

A phylogenetic tree is either rooted (with a common ancestor for all sequences) or

unrooted (without common ancestor). The unrooted trees are constructed when we do not

make the assumption that the molecular clock evolution is valid and they only reflect the

relationship among representative sequences but not the evolutionary path. However, if we

can make the assumption that sequences evolve at rates that remain constant through time

for different lineages, then the root of a tree is estimated as the midpoint of the longest span

across the tree.

7.2.5 Microbial Diversity Analysis

The microbial diversity or richness is calculated from the feature table, obtained in the

denoising step above, to describe the number of different species of microbes present

within individual samples and between samples. The diversity of the microbial community

within a sample is called alpha diversity, while the measure of similarity or dissimilarity

of microbial communities in two samples is called beta diversity. For alpha diversity, there

are several diversity metrics including Shannon’s diversity index, observed features, Faith’s

phylogenetic diversity, and evenness. The beta diversity metrics include Jaccard distance,

Bray–Curtis distance, and unweighted UniFrac distance.